|
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Most research on NER systems has been structured as taking an unannotated block of text, such as this one: :Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text that highlights the names of entities: :()Person bought 300 shares of (Corp. )Organization in ()Time. In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified. State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.〔 Elaine Marsh, Dennis Perzanowski, "MUC-7 Evaluation of IE Technology: Overview of Results", 29 April 1998 (PDF )〕〔(MUC-07 Proceedings (Named Entity Tasks) )〕 ==Problem definition== In the expression ''named entity'', the word ''named'' restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the ''automotive company created by Henry Ford in 1903'' is referred to as ''Ford'' or ''Ford Motor Company''. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.〔()〕 Full named-entity recognition is often broken down, conceptually and possibly also in implementations, as two distinct problems: detection of names, and classification of the names by the type of entity they refer to (e.g. person, organization, location and other). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to chunking. Temporal expressions and some numerical expressions (i.e., money, percentages, etc.) may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year ''2001'' refers to the ''2001st year of the Gregorian calendar''. In the second case, the month ''June'' may refer to the month of an undefined year (''past June'', ''next June'', ''June 2020'', etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term ''named entity'' is therefore not strict and often has to be explained in the context it is used.〔(Named Entity Definition ). Webknox.com. Retrieved on 2013-07-21.〕 Certain hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes. Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.〔(Sekine's Extended Named Entity Hierarchy ). Nlp.cs.nyu.edu. Retrieved on 2013-07-21.〕 More recently, in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Named-entity recognition」の詳細全文を読む スポンサード リンク
|